Skip to content

FEAT: Add ArabiziConverter for Arabic transliteration#1906

Merged
romanlutz merged 2 commits into
microsoft:mainfrom
Raulster24:raulster24/add-arabizi-converter
Jun 4, 2026
Merged

FEAT: Add ArabiziConverter for Arabic transliteration#1906
romanlutz merged 2 commits into
microsoft:mainfrom
Raulster24:raulster24/add-arabizi-converter

Conversation

@Raulster24
Copy link
Copy Markdown
Contributor

Description

Adds ArabiziConverter, a deterministic PromptConverter that transliterates Arabic script into Arabizi (Latin-script "chat Arabic"), where letters with no Latin equivalent are written with shape-resembling digits (HAH -> 7, AIN -> 3, QAF -> 8). It applies a per-character mapping with Gulf-leaning conventions; no language model is involved, so the same input always produces the same output. Short-vowel diacritics and the tatweel connector are dropped, and non-Arabic text (Latin, digits, punctuation) is left unchanged. The mapping is intentionally lossy, mirroring how Arabizi is actually written.

The mapping follows the documented Arabic chat alphabet (Gulf-leaning where regional variants exist, e.g. QAF -> 8 with GHAIN -> gh to avoid the regional 8 collision). Feedback on specific letter choices is welcome.

Fourth in the set of atomic Arabic-script converters, following BidiConverter (#1832), TatweelConverter (#1869), and ArabicPresentationFormConverter (#1888). It can later migrate to a shared CharacterSubstitutionConverter base alongside UnicodeConfusableConverter.

cc @romanlutz

Tests and Documentation

  • Added tests/unit/prompt_converter/test_arabizi_converter.py: word transliteration, number-letters, multi-character mappings, dropped diacritics/tatweel, non-Arabic passthrough, mixed text, empty input, determinism, and unsupported-input-type rejection. All pass: uv run pytest tests/unit/prompt_converter/test_arabizi_converter.py
  • Registered in pyrit/prompt_converter/__init__.py (import + __all__).
  • Added a usage example to doc/code/converters/1_text_to_text_converters.py and regenerated the paired .ipynb plus the converter modality table in 0_converters.ipynb via JupyText.
  • ruff and ty are clean; the converter-documentation conformance test passes.

@romanlutz
Copy link
Copy Markdown
Contributor

FYI @Raulster24 technically we already have a character-level converter! Check LeetspeakConverter or EmojiConverter which has a map from letters to numbers/emojis. This seems to be the same pattern for Arabizi. I don't see much potential for generalizing with a common base class since that really comes down to a single line but it's a good thought.

@Raulster24
Copy link
Copy Markdown
Contributor Author

FYI @Raulster24 technically we already have a character-level converter! Check LeetspeakConverter or EmojiConverter which has a map from letters to numbers/emojis. This seems to be the same pattern for Arabizi. I don't see much potential for generalizing with a common base class since that really comes down to a single line but it's a good thought.

@romanlutz Makes sense, you are right. WordLevelConverter already covers this pattern (Leetspeak, Emoji), so a separate base class would duplicate it. I'll drop the CharacterSubstitutionConverter and keep the Arabic converters standalone as they are. Thanks for catching it.

Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reran the notebook to produce outputs. Looks great!

@romanlutz romanlutz added this pull request to the merge queue Jun 4, 2026
Merged via the queue into microsoft:main with commit 155d9af Jun 4, 2026
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants